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Mycoplasma, which was used to create the first "synthetic life", has been an important species in the 
emerging field, synthetic biology. However, essential genes, an important concept of synthetic biology, for 
both M. mycoides and M. capricolum, as well as 14 other Mycoplasma with available genomes, are still 
unknown. We have developed a gene essentiality prediction algorithm that incorporates information of 
biased gene strand distribution, homologous search and codon adaptation index. The algorithm, which 
achieved an accuracy of 80.8% and 78.9% in self-consistence and cross-validation tests, respectively, 
predicted 5880 essential genes in the 16 Mycoplasma genomes. The intersection set of essential genes in 
available Mycoplasma genomes consists of 153 core essential genes. The predicted essential genes (available 
from pDEG, tubic.tju.edu.cn/pdeg) and the proposed algorithm can be helpful for studying minimal 
Mycoplasma genomes as well as essential genes in other genomes. 

The year 2010 saw the creation of the first artificial self-replicating bacterial cells 1 . In this famous work, 
Venter's group designed, synthesized and assembled JCVI-synl.O, a 1.08 Mb Mycoplasma mycoides genome, 
which was then transplanted into a M. capricolum recipient cell. These efforts resulted in the creation of new 
M. mycoides cells, whose genetic materials only contain the synthetic chromosomes 1 . This is a technical milestone 
in the emerging field, synthetic biology, because conceptually, it means a synthetic life can be designed and made 2 . 

An important concept of synthetic biology is the minimal genome, which contains all essential genes of an 
organism 3,4 . The minimal genome can serve as a chassis in which interchangeable elements are inserted to create 
organisms with desirable traits 5 7 . Mycoplasma has been an important species for synthetic biology, mainly 
because of their small genome sizes. The first genome- scale gene essentiality screen was performed in a 
Mycoplasma genome 8 . However, the essential genes for both M. mycoides and M. capricolum, as well as those 
for 14 other Mycoplasma with available genomes are not known. The goal of the current study was to develop a 
novel and reliable algorithm to predict essential genes in the 16 Mycoplasma genomes. 

Identification of essential genes in silico is important and necessary, not only because their experimental 
determination is highly labor-intensive and time-consuming, but also because the speed for genome sequencing 
far outpaces that of the genome-wide gene essentiality studies. Although experimental techniques in identifying 
essential genes have been dramatically improved, genome- wide gene essentiality data are only available in 15 
bacterial genomes 9 . In contrast, the number of available genomes has reached 1000, and the projects of sequen- 
cing 4000 more bacterial genomes are underway. With the increasing ability for genome sequencing, the in silico 
prediction of essential genes will be more and more important. 

Various algorithms have been proposed to predict essential genes. Most algorithms are based on various 
genomic features, which include connectivity in protein-protein interaction network, fluctuation in mRNA 
expression, evolutionary rate, phylogenetic conservation, GC content, codon adaptation index (CAI), predicted 
sub-cellular localization and codon usages 1016 . Because bacterial essential gene products comprise attractive drug 
targets for developing antibiotics, some studies are aimed at identifying essential genes that could serve as drug 
targets. These studies mainly rely on homologous search against available essential genes, for instance, through 
homologous searches against DEG (database of essential genes) 9,17 , based on the notion that those homologous to 
known essential genes are likely to be essential also. These bacterial pathogens include: Pseudomonas aerugi- 
nosa 1 ", Burkholderia pseudomallei 19 , H. pylori 20 , Aeromonas hydrophila 21 , Neisseria gonorrhoeae 22 , Aeromonas 
hydrophila 23 and Wolbachia 24 . Very recently, Duffield and coworkers, by using a modified down-selectoin 
computational tool, predicted 52 essential genes that are conserved in 7 or more genomes in DEG, and 7 of 
the 8 genes that were experimentally validated in Yersinia pseudotuberculosis were found to be esesntial 25 . 

Essential genes have been known to be biasedly distributed in leading and lagging strands in E. coli and B. 
subtilis 26 . We then confirmed this phenomenon in 10 genomes in which gene essentiality screens had been 
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performed 27 . However, the information of the biased essential gene 
distribution has not been effectively integrated into the gene essen- 
tiality prediction programs. With the availability of DoriC 28 , the 
database that contains replication origins for almost all bacterial 
genomes, such information (gene distribution in leading and lagging 
strands), if can be effectively used, will be helpful for the essential 
gene prediction for most bacterial genomes. 

We developed an algorithm that integrates the information of 
biased distribution of essential genes in leading and lagging strands, 
in addition to homologous search and CAI values. The algorithm, 
which is simple and reliable, achieved an accuracy of 80.8% in pre- 
dicting essential genes in M. pulmonis genome (self-consistence test), 
and achieved an accuracy of 78.9% and 78.1% in predicting those in 
S. aureus and Bacillus subtilis genomes, respectively (cross validation 
tests). Second, we then predicted 5880 essential genes in 16 
Mycoplasma genomes. The detailed information of the genes is orga- 
nized into a Database of predicted Essential Genes (pDEG) (http:// 
tubic.tju.edu.cn/pdeg). The intersection set of essential genes in 18 
Mycoplasma genomes (5880 predicted in the 16 Mycoplasma gen- 
omes, 379 and 310 experimentally determined in M. genitalium and 
M. pulmonis, respectively), consists of 153 core essential genes. The 
proposed algorithm and the prediction results will be helpful for 
studying essential genes in Mycoplasma as well as in other genomes. 
In particular, it is helpful for designing various Mycoplasma chassis 
used in synthetic biology. 

Results 

Training procedure and the self-consistence test. The training set 
included 379 and 310 essential genes for M. genitalium G37 (M. gen) 
and M. pulmonis UAB CTIP (M. pul), respectively. The training 
procedure could be performed in one of the two manners: essential 
genes of M. pul are predicted based on those of M. gen; or conversely, 
essential genes of M. gen are predicted based on those of M.pul. Since 
the average size of the 16 Mycoplasma genomes is about 1 Mb (see 
Table 1), the M. gen genome did not seem to be a suitable 
representative, because it has the smallest genome size (0.58 Mb). 
Therefore, we chose to train the parameters based on the first 
manner, i.e., essential genes of M. pul (genome size about 1 Mb), 
were predicted based on the experimentally determined ones of M. 



gen. The highest prediction accuracy achieved in the training 
procedure represents the self-consistence test accuracy that the 
present algorithm can reach. The parameters obtained following 
the training procedure can then be used to predict essential genes 
in the 16 Mycoplasma genomes. 

Comparing the prediction with essential genes identified experi- 
mentally in the M. pul genome, parameters were determined such 
that the prediction accuracy reached the best value. The detailed 
training procedure is described in Fig. 1. We intended to keep the 
sensitivity S„ being roughly equal to the specificity S p (Fig. 2a). The 
corresponding ROC curve is shown in Fig. 2b, where the AUC (Area 
Under the Curve) value was 0.812. The detailed prediction accuracy 
in terms of leading and lagging strands is listed in Table 2. Overall, 
the accuracy was 80.8% (S„ = 0.78 and S p = 0.83), which may be 
considered as the highest self-consistence test accuracy that the pre- 
sent algorithm can reach. 

Cross-validation test. In addition to the self-consistence tests, the 
algorithm should also be evaluated by an independent data set. That 
is, once the parameters are determined, they should be tested by 
using a genome whose essential genes are experimentally 
determined, but M. gen and M. pul genomes should be excluded. 
However, so far M. gen and M. pul have been the only 2 genomes in 
the Mycoplasma family that have genome wide gene essentiality 
studies performed. Therefore, instead of using the information of 
essential genes of a third Mycoplasma genome, which is 
unavailable, we chose to use two bacterial genomes closely related 
to the two Mycoplasma genomes, Bacillus subtilis str. 168 and 
Staphylococcus aureus N315, whose essential genes were identified 
experimentally 2931 . 

Using the parameters in the training procedure of the algorithm, 
we predicted the essential genes for B. subtilis str. 168 and S. aureus 
N315. We find that instead of merely using the information of the 
379 essential genes in the M. gen genome, the prediction accuracy can 
be improved using the combined set of the 379 and 310 essential 
genes in genomes of M. gen and M. pul, respectively. The prediction 
results are listed in Table 3. The average AUC value equals to (0.813 
+ 0.778)/2 = 0.796. The average prediction accuracy (78.1% + 
78.9%)/2 = 78.5% may be deemed as the cross-validation test 



Table 1 | Detailed prediction and related information for the 1 6 Mycoplasma genomes" 

Predicted essential genes 
Size 



Total 



genes 



Organism 


Abbr. 


(Mb) GC(%) 


Leading 


Lagging 


Both 


Leading 


Lagging 


Both 


RefSeq 


Mycoplasma agalactiae 


Mag 


1.01 


29.0 


259 


118 


377 


513 


300 


813 


NC 


013948 


Mycoplasma agalactiae PG2 


MagPG2 


0.88 


29.7 


253 


115 


368 


452 


290 


742 


NC 


009497 


Mycoplasma arthritidis 158L3-1 


Mar 


0.82 


30.7 


215 


103 


318 


386 


245 


631 


NC 


01 1025 


Mycoplasma capricolum subsp. 


Mca 


1.01 


23.8 


282 


78 


360 


591 


221 


812 


NC 


.007633 


capricolum ATCC 27343 
























Mycoplasma conjunctivae HRC/ 
581 


Mco 


0.85 


28.6 


218 


108 


326 


469 


222 


691 


NC. 


.012806 


Mycoplasma crocodyli MP 1 45 


Mcr 


0.93 


27.0 


232 


127 


359 


404 


285 


689 


NC 


014014 


Mycoplasma gallisepticum str. R(low) 


Mga 


1.01 


31.5 


341 


72 


413 


604 


159 


763 


NC 


004829 


Mycoplasma genitalium G37 


Mge 


0.58 


31.7 


317 


62 


379 


385 


92 


477 


NC 


000908 


Mycoplasma hominis 


Mho 


0.67 


27.0 


219 


91 


310 


343 


180 


523 


NC 


01351 1 


Mycoplasma hyopneumoniae 232 


Mhy232 


0.89 


28.6 


187 


156 


343 


366 


325 


691 


NC 


006360 


Mycoplasma hyopneumoniae 7448 


Mhy7448 


0.92 


28.5 


183 


163 


346 


346 


31 1 


657 


NC 


007332 


Mycoplasma hyopneumoniae J 


MhyJ 


0.90 


28.5 


185 


161 


346 


343 


314 


657 


NC 


007295 


Mycoplasma mobile 1 63K 


Mmo 


0.78 


25.0 


245 


118 


363 


401 


232 


633 


NC 


006908 


Mycoplasma mycoides subsp. 


Mmy 


1.21 


24.0 


286 


1 15 


401 


647 


369 


1016 


NC. 


.005364 


mycoides SC str. PG 1 
























Mycoplasma penetrans HF-2 


Mpe 


1.36 


25.7 


344 


56 


400 


849 


188 


1037 


NC 


004432 


Mycoplasma pneumoniae M 1 29 


Mpn 


0.82 


40.0 


404 


90 


494 


546 


143 


689 


NC 


000912 


Mycoplasma pulmonis UAB CTIP 


Mpu 


0.96 


26.6 


208 


102 


310 


484 


298 


782 


NC 


002771 


Mycoplasma synoviae 53 


Msy 


0.80 


28.5 


202 


154 


356 


334 


325 


659 


NC 


007294 


"Bold figures denote essential genes that are experimentally identified. Note the biase 


d distribution 


of essential genes 


between lec 


ding and lagging strands. 
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accuracy: 
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compute CAI values 



Compute the prediction score s for 
each candidate gene 




Prediction data: 
all protein-coding genes 
from a Mycoplasma 
genome 




Putative essential 
genes 



Figure 1 | The flow chart of the proposed algorithm in training and prediction phases. 
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1 - Specificity 

Figure 2 | Accuracy indices and the ROC curve for the current algorithm. 

(A) Sensitivity, specificity and positive prediction rate in relation to the 
parameter s defined in eq. (8). The value of s (s > s 0 ) was chosen such that 
the sensitivity S„ is roughly equal to the specificity Sp. (B) The ROC curve 
(blue) and AUC (Area Under Curve). The red line denotes an 
extrapolation of the ROC curve to the point where 1 — S p = 1 . The AUC 
value is found to be 0.812. 

accuracy of the present algorithm. Because these two genomes do not 
belong to the Mycoplasma family, it is likely that overall accuracy of 
the present algorithm in predicting Mycoplasma essential genes is in 
the interval: 78.5% < accuracy £ 80.8%. For some of the 16 
Mycoplasma genomes under study, it is possible that the prediction 
accuracy exceeds 80.8%, because they are much more closely related 
to M. gen and M. pul than B. subtilis and S. aureus. 

Prediction of essential genes in the 16 Mycoplasma genomes. 

Based on the parameters obtained in the training procedure and 
the aggregate set of the 379 and 310 essential genes for M. 
genitalium G37 and M. pulmonis UAB CTIP, respectively, essential 
genes for the 16 Mycoplasma genomes were predicted. A total of 5880 
essential genes were predicted, with on average 368 essential genes in 
each genome. The overall prediction results are listed in Table 1. The 



Table 3 The cross-validation test accuracy 


a 




Organ ism 


Strand 






A 


bacillus subtilis 


Leading 


69.8% 


86.6% 


78.2% 


168 


Lagging 


36.8% 


94.1% 


65.5% 




Both 


67.5% 


88.7% 


78.1% 


Staphylococcus 


Leading 


73.3% 


85.5% 


79.4% 


aureus N3 1 5 


Lagging 


41 .4% 


93.3% 


67.3% 




Both 


70.2% 


87.6% 


78.9% 


"Bold figures denote the 


overall prediction 


accuracy. 







detailed information for each of the predicted essential gene is 
described in a database of predicted essential genes (pDEG), which 
is accessible from the website: http://tubic.tju.edu.cn/pdeg/. The 
database pDEG is organized with the same form as DEG. In 
pDEG, the detailed information of all the predicted essential genes 
can be obtained, including their names, functions, DNA and protein 
sequences and COG codes. If a predicted essential gene codes for an 
enzyme, the EC number and the KEGG linkage 32 describing the 
involved metabolic pathway are also provided. Users can search 
for a predicted essential gene by their functions and names, and 
can also browse and download all the records in pDEG. 

Core essential genes for the Mycoplasma family. The phylogenetic 
tree of the 18 Mycoplasma genomes was drawn based on the 16S 
rRNA (Fig. 3), where the abbreviations of 18 bacteria are shown in 
Table 1. We then obtained the intersection set of genes and essential 
genes based on reciprocal homolog searches between genomes. For 
example, the number of intersection genes between the genomes of 
M. mycoides and M. capricolum was 679. The number of overall 
intersection genes among the 18 Mycoplasma genomes was 191. 
Similarly, the numbers of intersection essential genes between two 
genomes or two genome clusters are shown in Fig. 3b. Note that the 
essential genes of the M. genitalium and M. pulmonis genome are 
identified experimentally, whereas the essential genes of remaining 
16 bacterial genomes are predicted in the present study. 

The intersection set of the essential genes in the 18 Mycoplasma 
genomes (5880 predicted in the 16 Mycoplasma genomes, 379 and 
310 experimentally determined in M. genitalium and M. pulmonis, 
respectively) consists of 153 genes, which are called core essential 
genes for the Mycoplasma family. The core essential genes likely 
encode functions that are absolutely required for the survival of 
Mycoplasma, and their homologues in other bacteria likely have 
critical functions as well. Detailed information of the 153 core essen- 
tial genes is available from pDEG. 

Discussion 

Essential genes are those indispensable for the survival of an organ- 
ism under certain conditions, and the essential-gene concept is espe- 
cially important for the burgeoning field, synthetic biology. A goal in 
synthetic-biology field is to develop the cellular chassis, which, com- 
posed of essential genes, contains all necessary components for cell 
survival. Based on the chassis, other gene circuits can be inserted to 
create experimental organisms with desirable traits that serve human 
needs. We here put forward two concepts: pan essential genes and 



Table 2 | The self-consistence test accuracy" 

Organism Strand S n S p S + A 

Mycoplasma pulmonis UAB Leading 80.3% 82.6% 77.7% 81.4% 

CTIP (M. pul\ Lagging 74.5% 84.2% 71.0% 79.3% 

Both 78.4% 83.3% 75.5% 80.8% 

"The bold figure denotes the overall prediction accuracy. 
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(191) 



(194) 



L(205) 



(216) 



-(247) 



-(381) 



r 

81) 

L 



(669) 



(438) 



(260) 



(224) 



- (366) 



i r 

L(59 



(594) 

L 



-(283) 



(617) 



r 

- (363) 

L 



(435) 



-(679) 



Mpn (689) 

■ Mge (477) 
Mga (763) 

■ Mpe (1037) 
Mco (691) 
MagPG2 (742) 
Mag (813) 

■ Mcr (689) 
Msy (659) 
Mmo (633) 
MhyJ («57) 
Mhy7448 (657) 

■ Mhy232 (691) 

■ Mpu (782) 
Mar (631) 

' Mho (523) 
Mmy (1016) 
Mca (812) 



B 



(153) 



(155) 



(169) 



(173) 



-(193) 



r 

-(291) 

L 



(355) 



(308) 



L(206) 



L(180) 



-(274) 

I r 

L (33 



(332) 

L 



-(219) 



(336) 



-(255) 

L 



(288) 



-(334) 



Mpn (494) 
Mge (379) 
Mga (413) 
Mpe (400) 
Mco (326) 
MagPG2 (368) 
Mag (377) 
Mcr (359) 
Msy (356) 
Mmo (363) 
MhyJ (346) 
Mhy7448 (346) 
Mhy232 (343) 
Mpu (310) 
Mar (318) 
Mho (310) 
Mmy (401) 
■ Mca (360) 



Figure 3 | The phylogenetic tree of the 18 Mycoplasma genomes based on the 16S rRNA. The intersection set of (A) genes and (B) essential genes in the 
18 Mycoplasma genomes. The numbers on the left indicate gene numbers in intersection sets between genomes, whereas those on the right denote total 
gene number in a genome. The intersection set of the 5880 predicted essential genes and those experimentally identified in M. genitalium and M. pulmonis 
genomes consists of 153 core essential genes for the Mycoplasma family. 



core essential genes. For Mycoplasma species, pan essential genes are 
the combined essential gene set, while core essential genes are the 
intersection set of essential genes among Mycoplasma species. Based 
on the current dataset, the number of Mycoplasma pan essential 
genes is 6569 (5880 predicted, 379 and 310 experimentally deter- 
mined in M. genitalium and M. pulmonis, respectively). However, 
we hypothesize that although the number of pan essential genes will 
continue to increase with more Mycoplasma genomes, the number of 
core essential genes (153) will largely remain the same. The core 
essential genes are likely needed for all Mycoplasma genomes and 
are likely all needed for the Mycoplasma chassis. 

Indeed, the core essential genes are generally functionally import- 
ant, and are involved in critical cellular processes. Based on COG 
functional classification 33 , core essential genes, compared to non- 
core essential and non-essential genes, had a higher proportion of 
genes involved in information storage and processing (Fig. 4a), and 
most of the core ones (55%) are involved in translation, ribosomal 
structure, transcription and replication (Fig. 4b). For example, they 
include most genes coding for 30S, 50S ribosomal proteins and ami- 
noacyl-tRNA synthetases. They include those involved in replica- 
tion, such as replication initiation protein (dnaA), replication 
DNA helicase (dnaB), DNA gyrase subunit A (gyrA) and subunit B 
(gyrB), DNA ligase (ligA), DNA polymerase III subunit-related pro- 
teins (dnaX,polC) and DNA primase (dnaG). They include genes of 4 
protein synthesis elongation factors G, P, Ts and Tu (fusA, efp, tsf and 
tuf) and 2 translation initiation factors IF-2 (infB) and IF-3 (infC), 
and transcription related genes, such as DNA-directed RNA poly- 
merase subunit alpha (rpoA) and beta (rpoB) and RNA polymerase 
sigma factor RpoD (rpoD). They also include almost all subunits of 



F0F1 ATP synthase (atpA, atpB, atpD, atpE and atpG) and many 
enzymes involved in energy production and metabolism. For details, 
refer to http://tubic.tju.edu.cn/pdeg/core/. 

It is noteworthy that some core essential genes do not have clearly 
defined functions. For instance, MG_423 encodes a hypothetical 
protein (accession number NP_073094) in the M. genitalium gen- 
ome. Blast searches suggested that this gene likely encodes ribonu- 
cleas J, which plays a key in mRNA degradation 34 . Being a core 
essential gene prioritizes this gene to be further functionally charac- 
terized. 

In summary, we here have predicted essential genes of the 16 
Mycoplasma genomes currently available in GenBank, based on 
experimentally identified essential genes of the M. genitalium and 
M. pulmonis genomes. The algorithm is simple and effective. The 
cross-validation test shows that the sensitivity S„ and the specifi- 
city S p of the algorithm are all roughly equal to 80%. This accu- 
racy means that about 80% of the essential genes in the 
Mycoplasma genomes under study are correctly predicted as 
essential; likewise, about 80% of the non-essential genes in these 
genomes are correctly predicted as non-essential. The high accu- 
racy achieved is mainly due to the homologous mapping among 
evolutionally closely related bacteria, together with other informa- 
tion including biased distribution of essential genes in leading and 
lagging strands and CAI values. Mycoplasma has been an import- 
ant species in the field of synthetic biology. The prediction results 
and the proposed algorithm can be useful in studying the min- 
imal genomes of Mycoplasma, and in gene essentiality studies for 
other genomes. In particular, it is helpful for designing various 
Mycoplasma chassis used in synthetic biology. 
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Figure 4 | Functional classification of genes in the M. genitalium genome based on COG. (A) COG classification of core-essential, non-core-essential 
and non-essential genes in M. genitalium. (B) Distribution of COG classification of the 153 core-essential genes. 



Methods 

The genomic RefSeq protein sequences for all the 18 Mycoplasma genomes were 
downloaded from the NCBI website (ftp://ftp.ncbi.nih.gov/genomes/Bacteria). The 
alignment program BLAST + was downloaded from the same website (version Blast- 
2.2.23 + , ftp://ftp.ncbi.nih.gov/blast) 35 . There are 379 and 310 experimentally deter- 
mined essential genes for M. genitalium G37 36 and M. pulmonis UAB CTIP 37 , 
respectively. It is noteworthy that definition of essential genes depends on certain 
experimental conditions, such as in rich growth medium 38 . In addition, synthetic 
lethal (lethality due to inactivation of more than 1 gene) is not considered in single 
gene knockout experiments. The detailed information for each of the IS Mycoplasma 
genomes is listed in Table 1. 

Following parameters were used in the present study to assess the performance of 
the algorithm. 



TP 



S n = 



TP 
TP + FN ' 



TN 

9 TN + FP 



5+ = , (3) 

+ TP + FP y 1 



a - *za. (4) 

where TP, FN, FP and TN denote true positives, false negatives, false positives and true 
negatives, respectively. The sensitivity S n represents the proportion of essential genes 
that have been correctly predicted as essential. The specificity S p represents the 
proportion of non-essential genes that have been correctly predicted as non-essential. 
The positive prediction rate S+ represents the percentage of essential genes over the 
predicted ones. The accuracy A is the average of the sensitivity and specificity. 

The prediction is partially based on the alignment of protein primary sequences to 
be predicted against those from closely related organisms in DEG, using the program 
(1) Blastp. For each query protein sequence, we define 

1, if E = 0, 

i2i ' L log f .ifi^o, (5) 
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where E is the expectation value of the best scoring alignment in Blastp (with default 
parameters), and E min is the smallest E value other than 0 of all genes from the 18 
Mycoplasma genomes. 

The prediction is also based on the strand-bias of essential genes 26 . We define 
^ ( l + if the gene to be predicted is at the leading strand; 

\ 1 + jE? 25 if the gene to be predicted is at the lagging strand, 

where b is a real number and /5 1> ^ 2 e [ — 1 > 1 ] • The replication origin and terminus are 
determined based on the DoriC database 28 . 

Finally, the prediction is also partially based on the CAI value of a gene to be 
predicted 13 ' 14 ' 16 . The CAI values were calculated using the CodonW software (http:// 
codonw.sourceforge.net). We define 

CAI — CAI 

c = 1 + y x =- , (7) 

CAI y } 

where c is a real number and y e [0,1]. Accordingly, we define the prediction 
parameter 5 by 

5 = e x b x c . (8) 

Using an iterative procedure (Fig. 1), the parameters /? and y were determined 
based on the training set. For each gene to be predicted we calculate the set of 
parameters (e, b, c), and finally the prediction parameter 5. We further look for a 
threshold s 0 such that if s > s 0 , the gene is predicted to be essential, otherwise, if s < s 0 , 
the gene is predicted to be non-essential. Detailed prediction results are available from 
the website http://tubic.tju.edu.cn/pdeg/, and programs are available upon request. 
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